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Abstract. We study the link structure of on-line social networks (OSNs), and in- 
troduce a new model for such networks which may help infer their hidden underlying 
reality. In the geo-protean (GEO-P) model for OSNs nodes are identified with points 
in Euclidean space, and edges are stochastically generated by a mixture of the rel- 
ative distance of nodes and a ranking function. With high probability, the GEO-P 
model generates graphs satisfying many observed properties of OSNs, such as power 
law degree distributions, the small world property, densification power law, and bad 
spectral expansion. We introduce the dimension of an OSN based on our model, and 
examine this new parameter using actual OSN data. We discuss how the geo-protean 
model may eventually be used as a tool to group users with similar attributes using 
only the link structure of the network. 

1. Introduction 

On-line social networking sites such as Facebook, Flickr, Linkedin, MySpace, and 
Twitter are examples of large-scale, complex, real-world networks, with an estimated 
total number of users that equals at least half of all Internet users [2] . We may model an 
OSN by a graph with nodes representing users and edges corresponding to friendship 
links. While OSNs gain increasing popularity among the general public, there is a 
parallel increase in interest in the cataloguing and modelling of their structure, function, 
and evolution. OSNs supply a vast and historically unprecedented record of large-scale 
human social interactions over time. 

The availability of large-scale social network data has led to numerous studies that 
revealed emergent topological properties of OSNs. For example, the recent study [19] 
crawled the entire Twitter site, and studied properties found among the 41.7 million 
user profiles and 1.47 billion social relations. The next challenge is the design and 
rigorous analysis of models simulating these properties. Graph models were successful 
in simulating properties of other complex networks such as the web graph (see the 
books lU [H] for surveys of such models), and it is thus natural to propose models for 
OSNs. Few rigorous models for OSNs have been posed and analyzed, and there is no 
universal consensus of which properties such models should simulate. Notable recent 
models are those of Kumar et al. [L8\, Lattanzi and Sivakumar [20j, and the Iterated 
Local Transitivity model [5j. 
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Researchers are now in the enviable position of observing how OSNs evolve over 
time, and as such, network analysis and models of OSNs typically incorporate time as 
a parameter. While by no means exhaustive, some of the main observed properties of 
OSNs include the following. 

(i) Large-scale. OSNs are examples of complex networks with number nodes (which 
we write as n) often in the millions; further, some users have disproportionately high 
degrees. For example, some of the nodes of Twitter corresponding to well-known celebri- 
ties have degree over five million. 

(ii) Small world property and shrinking distances. The small world property, intro- 
duced by Watts and Strogatz [2^, is a central notion in the study of complex networks 
(see also [I?]). The small world property demands a low diameter of 0(log?T,), and 
a higher clustering coefficient than found in a binomial random graph with the same 
number of nodes and same average degree. Adamic et al. [1] provided an early study of 
an OSN at Stanford University, and found that the network has the small world prop- 
erty. Similar results were found in |2j which studied Cyworld, MySpace, and Orkut, 
and in [26] which examined data collected from Flickr, YouTube, LiveJournal, and 
Orkut. Low diameter (of 6) and high clustering coefficient were reported in the Twit- 
ter by both Java et al. and Kwak et al. (TH]. Kumar et al. [18J reported that in 
Flickr and Yahoo!360 the diameter actually decreases over time. Similar results were 
reported for Cyworld in [2] . Well-known models for complex networks such as preferen- 
tial attachment or copying models have logarithmically growing diameters with time. 
Various models (see [211 122]) were proposed simulating power law degree distributions 
and decreasing distances. 

(iii) Densification power law. A graph G with ct edges and nt nodes satisfies a 
densification power law if there is a constant a in (1, 2) such that Ct is proportional to 
n^. In particular, the average degree grows to infinity with the order of the network. 
In [2T], densification power laws were reported in several real- world networks such as 
the physics citation graph and the internet graph at the level of autonomous systems. 
Densification was reported in Cyworld [2] and has been detected in other OSNs. 

(iv) Power law degree distributions. In a graph G of order n, let A^^ = Nk{n) be the 
number of nodes of degree k. The degree distribution of G follows a power law if A^a: 
is proportional to /c"'', for a fixed exponent b > 2. Power laws were observed over a 
decade ago in subgraphs sampled from the web graph, and are ubiquitous properties of 
complex networks (see Chapter 2 of [1]). Kumar, Novak, and Tomkins [18] studied the 
evolution of Flickr and Yahoo!360, and found that these networks exhibit power-law 
degree distributions. Power law degree distributions for both the in- and out-degree 
distributions were documented in Flickr, YouTube, LiveJournal, and Orkut [26j, as well 
as in Twitter [T6|[T9]. 

(vi) Bad spectral expansion. Social networks often organize into separate clusters in 
which the intra-cluster links are significantly higher than the number of inter-cluster 
links. In particular, social networks contain communities (characteristic of social orga- 
nization), where tightly knit groups correspond to the clusters [27]. As a result, it is 
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reported in [TU] that social networks, unlike other complex networks, possess bad spec- 
tral expansion properties realized by small gaps between the first and second eigenvalues 
of their adjacency matrices. 

Our main contributions in the present work are twofold: to provide a model — the 
geo-protean (GEO-P) model — which provably satisfies all six properties above (see Sec- 
tion [3] note that, while the model does not generate graphs with shrinking distances, 
the parameters can be adjusted to give constant diameter), and second, to suggest 
a reverse engineering approach to OSNs. Given only the link structure of OSNs, we 
ask whether it is possible to infer the hidden reality of such networks. Can we group 
users with similar attributes from only the link structure? For instance, a reasonable 
assumption is that out of the millions of users on a typical OSN, if we could assign the 
users various attributes such as age, sex, religion, geography, and so on, then we should 
be able to identify individuals or at least small sets of users by their set of attributes. 
Thus, if we can infer a set of identifying attributes for each node from the link structure, 
then we can use this in formation to recognize communities and understand connections 
between users. 

Characterizing users by a set of attributes leads naturally to a vector-based or geo- 
metric approach to OSNs. In geometric graph models, nodes are identified with points 
in a metric space, and edges are introduced by probabilistic rules that depend on the 
proximity of the nodes in the space. We envision OSNs as embedded in a social space, 
whose dimensions quantify user traits such as interests or geography; for instance, nodes 
representing users from the same city or in the same profession would likely be closer 
in social space. A first step in this direction was given in [2^, which introduced a 
rank-based model in an m-dimensional grid for social networks (see also the notion of 
social distance provided in [28]). Such an approach was taken in geometric preferential 
attachment models of Flaxman et al. [TT], and in the SPA geometric model for the web 
graph [3]. 

The geo-protean model incorporates a geometric view of OSNs, and also exploits 
ranking to determine the link structure. Higher ranked nodes are more likely to receive 
links. A formal description of the model is given in Section |2} Results on the model 
are summarized in Section |3| We present a novel approach to OSNs by assigning 
them a dimension; see the formula (|4]). Given certain OSN statistics (order, power law 
exponent, average degree, and diameter), we can assign each OSN a dimension based 
on our model. The dimension of an OSN may be roughly defined as the least integer m 
such that we can accurately embed the OSN in m-dimensional Euclidean space. Full 
proofs of our results are presented in Section |4j In the final discussion, we summarize 
our findings and conjecture on the correct diameter for OSNs. 

2. The GEO-P Model for OSNs 

We now present our model for OSNs, which is based on both the notions of embedding 
the nodes in a metric space (geometric), and a link probability based on a ranking of 
the nodes (protean). We identify the users of an OSN with points in m-dimensional 
Euclidean space. Each node has a region of influence, and nodes may be joined with a 
certain probability if they land within each others region of influence. Nodes are ranked 
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by their popularity from 1 to n, where n is the number of nodes, and 1 is the highest 
ranked node. Nodes that are ranked higher have larger regions of influence, and so 
are more likely to acquire links over time. For simplicity, we consider only undirected 
graphs. The number of nodes n is fixed but the model is dynamic: at each time-step, 
a node is born and one dies. A static number of nodes is more representative of the 
reality of OSNs, as the number of users in an OSN would typically have a maximum 
(an absolute maximum arises from roughly the number of users on the internet, not 
counting multiple accounts). For a discussion of ranking models for complex networks, 
see [I21II31IIS1I2S]. 

We now formally define the GEO-P model. The model produces a sequence {Gt '■ 
t > 0) of undirected graphs on n nodes, where t denotes time. We write Gt = (Vt, Ef). 
There are four parameters: the attachment strength a G (0, 1), the density parameter 
/5 G (0, 1 — a), the dimension m G N, and the link probability p G (0, 1]. Each node 
V E Vt has rank r{v,t) G [n] (we use [n] to denote the set {1,2, ... ,n}). The rank 
function r{-,t) : Vt — )■ [n] is a bijection for all t, so every node has a unique rank. The 
highest ranked node has rank equal to 1; the lowest ranked node has rank n. The 
initialization and update of the ranking is done by random initial rank (Other ranking 
schemes may also be used. We use random initial rank for its simplicity.) In particular, 
the node added at time t obtains an initial rank Rt which is randomly chosen from 
[n] according to a prescribed distribution. Ranks of all nodes are adjusted accordingly. 
Formally, for each v G Vt-i that is not deleted at time t, 

r{v, t) = r{v, t — 1) + 5 — 7, 

where 5 = 1 if r{v,t — 1) > Rt and otherwise, and 7 = 1 if the rank of the node 
deleted in step t is smaller than r{v,t — 1), and otherwise. 

Let S be the unit hypercube in M™, with the torus metric d{-, ■) derived from the Lqo 
metric. More precisely, for any two points x and y in M™", their distance is given by 

d{x,y) = min{||x -?/ + u||oo : u G {-1,0,1}'"}. 

The torus metric thus "wraps around" the boundaries of the unit cube, so every point 
in S is equivalent. The torus metric is chosen so that there are no boundary effects, 
and altering the metric will not significantly affect the main results. 

To initialize the model, let Gq = (V(),i?o) be any graph on n nodes that are chosen 
from 5*. We define the influence region of node v at time t > 0, written R{v,t), to be 
the ball around v with volume 

\R{v,t)\ = r{v,tyn-^ . 

For t > 1, we form Gt from Gt-i according to the following rules. 

(i) Add a new node v that is chosen uniformly at random from S. Next, indepen- 
dently, for each node u G Vt-i such that v G R{u,t — 1), an edge vu is created 
with probability p. Note that the probability that u receives an edge is pro- 
portional to pr{u,t — 1)~". The negative exponent guarantees that nodes with 
higher ranks {r{u,t — 1) close to 1) are more likely to receive new edges than 
lower ranks. 
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(ii) Choose uniformly at random a node u G VJ-i, delete u and all edges incident to 
u. 

(iii) Vertex v obtains an initial rank r{v,t) = Rt which is randomly chosen from [n] 
according to a prescribed distribution. 

(iv) Update the ranking function r(-,t) : Vt — )■ [n]. 

Since the process is an ergodic Markov chain, it will converge to a stationary distri- 
bution. (See [23] for more on Markov chains.) The random graph corresponding to this 
distribution with given parameters a, (3, m,p is called the geo-protean graph (or GEO-P 
model), and is written GEO-P(q;, /3, m,p). The coupon collector problem can give us in- 
sight into when the stationary state will be reached. Namely, let L = n{logn+0{u{n))) 
where a;(?7,) is any function tending to infinity with n. It is a well-known result that, 
with probability tending to 1 as n tends to infinity, after L steps all original vertices 
will be deleted. 

See Figure [T] for a simulation of the model in the unit square. 




Figure 1. A simulation of the GEO-P model, with n = 5, 000, a = 0.7, 
/3 = 0.15, m = 2, and p = 0.9. 

3. Results and Dimension 

3.1. Results. We now state the main theoretical results we discovered for the geo- 
protean model, with proofs supplied in the next section. The model generates with 
high probability graphs satisfying each of the properties (i) to (iv) we discussed in the 
introduction. Proofs are presented in Section |4j Throughout, we will use the stronger 
notion of wep in favour of the more commonly used aas, since it simplifies some of 
our proofs. We say that an event holds with extreme probability {wep), if it holds with 
probability at least 1 — exp(— 9(log^ n)) as n — > oo. Thus, if we consider a polynomial 
number of events that each holds wep, then wep all events hold. 
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Let = Ni:{n,p, a, (3) denote the number of nodes of degree k, and N>k = S«>a: 
The following theorem demonstrates that the geo-protean model generates power law 
graphs with exponent 

b = l + l/a. (1) 

Note that the variables N>k represent the cumulative degree distribution, so the degree 
distribution of these variables has power law exponent 1/a. 

Theorem 3.1. Let a G (0, 1), /3 G (0, 1 - a), m G N, p € (0, 1], and 

^i-^-P iogi/2 n<k< log-2"-i n. 

Then wep GEO-P{a, f3,m,p) satisfies 

iV>, = (1 + 0(log-i/3 n)) 

Our next results shows that geo-protean graphs are relatively dense. For a graph 
G = (V, E) of order n, define the average degree of G by d = 

Theorem 3.2. Wep the average degree of GEO-P{a, (3,m,p) is 

d= {l + o{l))^^n^-''~^. (2) 
1 — a 

Note that the average degree tends to infinity with n; that is, the model generates 
graphs satisfying a densification power law. In [21], densification power laws were 
reported in several real-world networks such as the physics citation graph and the 
internet graph at the level of autonomous systems. 

Our next result describes the diameter of graphs sampled from the GEO-P model. 
While the diameter is not shrinking, it can be made constant by allowing the dimension 
to grow as a logarithmic function of n. 

Theorem 3.3. Let a G (0, 1), /3 G (0, 1 - a), m G N, and p G (0, 1]. Then wep the 
diameter D of GEO-P{a, (3, m,p) satisfies 

D = log "1 n), and D = 0(n^^-°''>"^ log(i-")"' n). (3) 

In particular, wep the order of the diameter can be expressed as: 

in ^1 , n /^loglog^ 
log D = — r — log n + (J 



(1 — a)m \ m 

We note that in a geometric model where regions of influence have constant volume 
and possessing the same average degree as the geo-protean model, the diameter is 
Q{n^^). This is a larger diameter than in the GEO-P model. If m = Clogn, for some 
constant C > 0, then wep we obtain a diameter bounded above by a constant. 



Let G = (y,E) be a graph. For sets of vertices X,Y <Z V , define e{X,Y) to be 
the set of edges with one endpoint in X and the other in Y. For simplicity, we write 
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e{X) = e{X, X). Let N{v) be the neighbour set of the vertex v. The clustering coefficient 
of vertex v & V is defined as follows: 

e{N{v)) 



c{v) 



(Note that, formally, we need to assume that deg(f ) > 2 above, but this is aas the case 
in our model. One can define c{v) = when deg(f ) < 1.) In other words, c{v) G [0, 1] 
is the probability that two different neighbours of v, chosen uniformly at random, are 
adjacent. In the random graph G{n,p), the expected value of c{v) is p for any vertex 
V. The clustering coefficient of G is defined as 



Hence, E(c((j'(n,p))) = p. We prove that wep the GEO-P model, for some values of m, 
generates graphs with higher clustering coefficient than in a random graph G{n,d/n) 
with the same expected average degree. It follows from Theorem 3^ that 

d 



(l + o(l))-^ni— ^. 
1 — a 



We use the notation [xj2 to denote the largest even integer smaller than or equal to x. 

Theorem 3.4. Wep the clustering coefficient of G sampled from GEO-P{a, (3,m,p) 
satisfies the following inequality 

2 XX"* n-a 

3K 



c(G) > (1 + 0(1)) 

= (l + o(l))exp 



1 



-/ 



1 + a 
1 



p 



a 



a 



where /(f) = e(f ), and 



K 



n 



log n 



Note that if 
m < (1 



a 



(3) 



logn 
log log n 



1 



log log n 



(l + o(l))(l-a-/3) 



logn 
log log n ' 



then K ^ m, and the clustering coefficient of GEO-P {a, P,m,p) is wep at least 



:i+o(i)) 



a 



1 + a 



p = n"^'^ > (l + o(l)): 



— a 



c{G{n, d/n)). 



Hence, the clustering coefficient is larger than that of a comparable random graph. 

If m = o(logn) but large enough so that the condition K ^ m does not hold, a 
similar result holds but the error term is not (1 + o(l)) anymore. In this case, wep, 

m+o{m,) 

= n"(i) :$> c{Gin,d/n)). 



c{G) > ( ^ 
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Finally, if 

m = (1 + o(l))6(l — a — logn 
for some constant b E (0, 1), then K > [e^^^\2 > e^^^ — 2, and so wep 

= exp (b{l -a-P)\og(J- ^^-ji-^ j (1 + 0(1)) log nV 

For the random graph counterpart we have that wep 

c(G(n, d/n)) = (1 + o(l))^^n-"-^ = exp ( i-a - + oil)) \ogn\. 

1 — a V / 

Thus, we get larger clustering coefficient for h small enough; that is, for b < Bq, where 
bo = bo{a, (3) satisfies the following equation 

Kl-a-P)\og (i- 2(ei/L2) ) =-"-^- 

(Note that the function on the left hand side is increasing and tends to as 6 — )■ 0.) 
For b > bo we get the opposite behaviour; that is, the clustering coefficient is smaller 
compared to the binomial random graph counterpart. 

The normalized Laplacian of a graph relates to important graph properties; see [S]. 
Let A denote the adjacency matrix and D denote the diagonal degree matrix of a 
graph G. Then the normalized Laplacian of G is £ = / — D~^^'^AD~^^'^. Let = Aq < 
Ai < ■ ■ ■ < A„_i < 2 denote the eigenvalues of C The spectral gap of the normalized 
Laplacian is 

A = max{|Ai - 1|, |A„_i - 1|}. 

A spectral gap bounded away from zero is an indication of bad expansion properties. 
Bad expansion is characteristic for OSNs: see property (iv) in the introduction. The 
next theorem represents a drastic departure from the good expansion found in binomial 
random graphs, where A = o(l) [HI |9]. 

Theorem 3.5. Let a G (0, 1), /3 G (0, 1 - a), m G N, and p G (0, 1]. Let X{n) be the 
spectral gap of the normalized Laplacian of GEO-P{a, P,m,p) . Then wep 

(i) If m = m{n) = o(logn), then A(n) = 1 + o(l). 

(ii) If m = m{n) = Glogn for some G > 0, then 

A(n)>l-exp(-^). 

3.2. Dimension of OSNs. Given an OSN, we describe how we may estimate the 
corresponding dimension parameter m if we assume the GEO-P model. In particular, 
if we know the order n, power law exponent b, average degree d, and diameter D of 
an OSN, then we can calculate m using our theoretical results. Formula Q gives an 
estimate for a based on the power law exponent b. If d* = logo?/ logn, then equation 
([2]) implies that, asymptotically, 1 — a — (3 = d*. U D* = log D/ logn, then formula ^ 
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about the diameter implies that, asymptotically, D* = j^z^^- Thus, an estimate for 
m is given by: 

This estimate suggests that the dimension is proportional to log log If D is 
constant, this means that m grows logarithmically with n. Recall that the dimension 
of an OSN may be roughly defined as the least integer m such that we can accurately 
embed the OSN in m-dimensional Euclidean space. Based on our model we conjecture 
that the dimension of an OSN is best fit by approximately logn. 

The parameters b, d, and D have been determined for samples from OSNs in various 
studies such as [21 HSl [ISl [22] • The following chart summarizes this data and gives 
the predicted dimension for each network. We round m up to the nearest integer. 
Estimates of the total number of users n for Cyworld, Flickr, and Twitter come from 
Wikipedia [30] , and those from You Tube comes from their website pi] . When the data 
consisted of directed graphs, we took b to be the power law exponent for the in-degree 
distribution. As noted in pj, the power law exponent of 6 = 5 for Cyworld holds only 
for users whose degree is at most approximately 100. When taking a sample, we assume 
that some of the neighbours of each node will be missing. Hence, when computing d*, 
we used n equalling the number of users in the sample. As we assume that the diameter 
of the OSN is constant, we compute D* with n equalling the total number of users. 



Parameter 


OSN 










Cyworld 


Flickr 


Twitter 


YouTube 


n 


2.4 X 10'^ 


3.2 X 10'^ 


7.5 X 10'^ 


3 X 10*^ 


b 


5 


2.78 


2.4 


2.99 


d* 


0.22 


0.17 


0.17 


0.1 


D* 


0.11 


0.19 


0.1 


0.16 


m 


7 


4 


5 


6 



4. Proofs of results 

We will make frequent use of the following standard result about the sum of inde- 
pendent random variables, known as the ChernofF bound; for a proof see Theorem 2.8 
in [H]. 

Theorem 4.1. Let X be a random variable that can be expressed as a sum X = Y17=i -^i 
of independent random indicator variables where Xi G Be(pi) with (possibly) different 
Pi = F{Xi = 1) = EXj. Then the following holds for t > 0: 

F(X>EX + t) < exp(-^(^^^). 
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In particular, if e < 3/2, then 

F{\X -EX\>£EX) < 2exp 

Moreover, z/EX < log^n, then wep X = O(log^n). 

Before we prove the theorems discussed in the previous sections, we first give a lemma 
that shows that, if the initial rank is large enough, the rank of a vertex maintains a 
value close to its initial value until its death. The first lemma is proved in jl5j . 

Lemma 4.2 ([15]). Suppose that vertex v obtained an initial rank R > ^/n\og'^ n in 
the ranking by random initial rank scheme. Then wep 

r{v,t) = R{1 + 0{\og-^/^ n)) 

to the end of its life. 

As we will see, the degree of a vertex depends on its rank and its age. The measure 
used to quantify the age of a vertex is its age rank. The age rank a{v, t) of vertex v at 
time t is a number between 1 and n which represents the rank of v if all vertices alive 
at time t are ranked according to age, oldest first. So a{v,t) = 1 means that v is the 
oldest vertex alive at time t, while a{v,t) = n implies that v is the youngest. 

Lemma 4.3. Suppose that vertex v obtained an initial rank R > log^n. Moreover, v 
has age rank a{v, L) > Cn for some constant C G (0, 1). Then wep 

r{y,t) < i?-32i°s(i/^) = 0{R) 

during the whole process up to time L. 

Proof. If follows from Theorem 5.5 in [15] that wep the age rank of a vertex v after 

t <n log n/2 — 2n log log n 

steps is (1 + o(l))nexp(— t/n). Since v has age rank at least Cn at time L, this implies 
that C < (1 + o(l)) exp(— where is the age of v at time L (so v was born at 
time L — ty). Thus, t^ < {1 + o(l))nlog(l/C). Recall that the rank of v at the time 
it is born equals R; we wish to find an upper bound on r{v, L) by considering how the 
rank can change in t^ steps of the process. Consider the following random variable X^: 
Xq = -R, Xt+i = X( + 1 with probability Xt/n\ Xt+i = Xf, otherwise. The variable 
Xt is an upper bound on the rank of v after t steps. Note that this upper bound only 
considers changes to the rank due to a vertex of higher rank being inserted, not the 
change in rank due to vertices of lower rank being deleted. 
We introduce the stopping time: 

T = min{t > : Xt > 3i? or t > n/2}. 

For any time t < T, Xt goes up with probability at most 3R/n. After n/2 steps, 
our random variable increases by at most (1 + o(l))(3i?/n)(n) = (1 + o(l))3-R/2 wep, 
provided that n/2 < T. Hence, wep T > n/2. Thus, wep in the first n/2 steps after 
its birth, the rank of vertex v can increase by at most a factor of 3. Dividing the total 
life span of v into at most 21og(l/C) blocks of n/2 time-steps each, we obtain the 
result. □ 
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Proof of Theorems 3.1 and 3.2 The foUowing theorem shows how the degree of a 
given vertex depends on its age rank and its initial rank. 

Theorem 4.4. Let a G (0, 1), (3 e {0,1 - a), m e N, p e (0, 1], i = i{n) e [n]. Let 
Vi he the vertex in GEO-P{a, f3,m,p) whose age rank at time L equals a{vi,L) = i, and 
let Ri be the initial rank of Vi. 
If Ri ^ y^log^n, then wep 

degiv., L) = (1 + 0(log-^/^ n))p ( + (^) " —] n^"'^-^ . (5) 

Wl — a)n \n J n I 

Otherwise, that is if Ri < y/nlog^n, wep 

deg{vi,L) > (1 + Oi\og-'/^n))p ( j-^— + (n^/Hog-^'^ n)"^] n^"""^. (6) 

\(1 — ajn n J 

Proof. Let deg^(fi,L) denote the number of neighbours of Vi with age rank smaUer 
than i, and define 

deg"(t;i, L) = deg{vi, L) - deg+(t;i, L). 

Let us focus on these random variables independently. 

Since vertices are distributed uniformly at random, the expected initial degree of Vi 
(at the time Vi was born) is 

J2pr-^n'^ = pn~^ J + 0(1) = ^^n^~"~^ + 0(1). 

/^From the n — 1 vertices that were older than Vi at the time it was born, only i — 1 
remain. Since vertices are deleted uniformly at random, the expected number of older 
neighbours remaining equals 



Edeg+(t;i,L) = - n^-"-^ + 0(l). 

1 — a n 



— n ^ 

— an 

Since deg^(fj, L) can be expressed as a sum of independent random variables, it follows 
from Theorem 4.1 that wep this random variable is well concentrated around its expec- 
tation , provided that E deg'''(f L) = r2(log^?7,). This condition holds if, for example, 
i > n""'"^log^n. In this case we have that deg'''(fj,L) = (H-0(log~^''^n))Edeg^(wj,L). 

Next we consider the contribution to the degree of Vi of vertices tha t ar e younger 
than Vi- Suppose first that Ri > y/nlog^ n. It follows from Lemma 4.2 that wep 
r{vi, t) = Ri{l + 0(log~^^^ n)) to the end of its life. Therefore, 

Edeg-(t;i,L) = {1 + 0{\og~^/^ n))pR~''n-'^{n - i) 

= (l + 0{\og-'/'n))p(^] "^^n^— 



n / n 



Similarly as before, this is well concentrated around its expectation. More precisely, 
wep 

deg-(t;„L) = (l + 0(log-^/2n))Edeg-(t;„L), 
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provided that Edeg^(fi,L) = i7(log^n). Note that it is sufficient to have i < n — 
fif^+P log^ ra, even li Ri = n. 

Let us mention that if one of d~^{vi,L), d~{vi,L) is O(log^n) in expectation, then 
the other is expected to be Q{n^~"'~'^) and wep is concentrated around its expectation. 
This imphes that wep their sum deg(vj,L) is always concentrated for Ri > ^/n\og'^ n. 
Combining the number of older and younger neighbours, we obtain that wep Q holds. 

Finally, if Ri < y/n\og^ n, then wep r{vi,t) < y/n\og^ n{l + 0(log~^^^ n)) to the end 
of its life. The argument for deg''"(f j) is analogous and so is omitted, but we only obtain 
a lower bound for deg~(fj). Thus, we find that wep (pi) holds. □ 



It follows from Theorem |4.4 that wep the minimum degree is 

(l + o(l))pni-"-^. 

Vertices of minimum degree are old and of low rank: they have age rank i = o{n) and 
thus lost most of their initial links, and their initial rank Ri = n — o(n), so they never 
acquired many new links. 

The maximum degree is changing during the process and this behaviour is not possible 
to predict. However, in order to get an upper bound, suppose the extreme case where 
the oldest vertex is ranked number one during its entire life. In such an extreme case, 
the degree would be (l + o(l))pn^~^ wep which indicates that wep the maximum degree 
is at most (1 + o{l))pn^~'^ . 

We now turn to the average degree. 



Proof of Theorem 3^. By Theorem 4.4 the average degree is 



2\E\ 



n 



n 



^deg+(i;„L) 



1=1 



P 



-n 



a n 



p 



1 



-n 



□ 



a 



(7) 



The proof of Theorem |3.1| is now a simple consequence of Theorem |4.4[ Let k be 
such that 

^l-a^/3 1 1/2 ^ < ^ < nl-"/2-/3 n. 



One can show that wep each vertex Vi that has the initial rank Ri > 
that 



n log^ n such 



Ri 



n 



> (l + log"^/^n) 



pn 



l-Q-,3 



n 



n 



has fewer than k neighbours, and each vertex Vi for which 



Ri 



n 



< (l - log^^/^n) 



pn ' k 

n 



l/a 



l/a 



has more than k neighbours. 
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Let io be the largest value of i such that 



n J ^Jn 
(Note that io = n — 0{n/ logn), since k < log"^"""*^ n.) Thus, 

EiV>, = £(l + 0(log-^/=^n)) (pn^-"-^^A:-M +0 ^ 
= (l + 0(log-^/=^n)) (pn-"-^A;-i)'/"^(--^V/- 



— «) 

i=l 

1+1/a 



(l + 0(log-^/^n)) {pn-'^-^k-^y'' ^ 
(1 + 0(log-^/=^ n)) 



and the assertion follows from the Chernoff bound, since k < ni-°/2-/3iog-2"-i^ and 
so EZ>fc = n(v/^log^+^/''n). □ 



Proof of Theorem 3.3 We consider the upper and lower bounds in separate argu- 
ments. 

Upper hound. The idea of the proof is to show that we can construct a connected 
subgraph of vertices spread out over the hypercube with a small diameter. This sub- 
graph will act as a backbone, and the next step in the proof shows that each vertex is 
connected to the backbone by a path of length at most 2. 

We will fix values A G (0, 1) and R G [n] to suit our needs later. To construct the 
backbone we partition the hypercube into 1/A subcubes, each of volume A. Consider 
all vertices with initial rank at most R and age rank between n/4 and n/2; we will call 
such vertices eminent vertices. We now choose A and R such that wep {i) the influence 
region of each eminent vertex contains the subcube in which it is located as well as all 
neighbouring subcubes, and (ii) each subcube contains at least log^ n eminent vertices. 

Property (i) will be achieved if the sphere of influence of each eminent vertex has size 
at least 4"^ A throughout the process. Note that the sphere of influence of an eminent 



and by Lemmas 4^ and 4^ wep it remains 
at least 3~^"'°s^i?~°n~'^ > {R/25)~°'n~^ to the end of the process. We can thus achieve 
(i) by choosing R and A such that the initial influence region is sufficiently larger than 
4"^ A — we choose 5"^ A. This leads to our first condition on R and A: 

(i?/25)-"n-^ = S'^A. (8) 

It follows from the Chernoff bound that, to guarantee that wep every subcube con- 
tains at least log^ n eminent vertices, it is sufficient to choose A and R so that the 
expected number of eminent vertices in a subcube is at least | log^ n. Since the initial 
rank is independent of age, the expected number of eminent vertices in a subcube equals 
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{n/4:){R/n)A = {R/A)A. Thus the following condition on A and R guarantees (u): 

A^ = ^-\og'n. (9) 

Combining ([s]) and ^ we find the following values for R and A: 

R = (10-25-"-5™n^log2n)^/(^-") 

1 (10- 25~")^/(^-") ^ ^ 

— = oi-ani-'i logi-" n. 

A 10 ^ 

The set of all eminent vertices form the backbone. Since vertices in each subcube 
induce a random graph with a parameter p (constant, bounded away from zero), wep 
the induced subgraph is connected and of diameter of two. For any two neighbouring 
subcubes, the probability that there is no edge between them is equal to 

(1 -p)^^""'^ = exp(-i7(log^r;.)), 

so wep there is at least on edge connecting two neighbouring subcubes. Since the largest 
metric distance in the hypercube with torus metric equals 1/2, and the subcubes have 
diameter 2A^^"^, the diameter of the backbone is wep 



O ((l/A)^/") = 0(n(rai logira 



n) 



To finish the proof, we will show that wep vertex v that is not in the backbone is 
within distance two from some vertex in the backbone. Consider any vertex v not part 
of the backbone. Consider Sy, the ball of volume n~°~^ centered at v. The volume of 
Sy is the minimum volume of a sphere of infiuence, so for each vertex w in Sy, edge vw 
exists with probability p. Note also that every vertex of age rank greater than n/2 in 
any subcube links to the backbone vertex in the same subcube with probability p, since 
the sphere of infiuence of the backbone vertex includes all of the subcube. There are 
(1 + o{l))n^~°'~^ /2 vertices of age rank greater than n/2 in Sy, and a path of length 2 
from V to the backbone using that vertex exists with probability p^. Thus, wep such a 
path exists. □ 

Lower bound. Note that for m = 0(logn) the theorem states that D > 1 and the lower 

bound trivially holds. We can assume then that m = o(logn). Let R = n^-°' log n. 
Note that at every point of the process, the union of infiuence regions of vertices with 
rank at most R has volume at most 

R 
r=l 

These vertices can generate long edges, we will call such vertices hubs; other edges are 
short. The length of every short edge is wep at most 



;i + oil)) [R-'^n-^y'"" = (1 + o(l))nU^ log^ 



since the rank at least R is well concentrated. (This time the length corresponds to the 
torus metric, not the graph distance.) 
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Since the total volume of the influence regions of the hubs is o(l), there must be a 
vertex v and a constant c G (0, 1/3) so that wep the ball around v with radius c does 
not intersect any hub of an influence region. Moreover, since vertices are uniformly 
distributed, wep there is a vertex u at (metric) distance greater than c from v. Any 
path from v to u must use short edges to bridge the distance from v to the edge of the 
circle with radius c. This implies that wep the path has (graph distance) length at least 

fi -a 

(1 + o(l))r;,(i-'*)™ log ™ n, which finishes the proof. □ 



Proof of Theorem 3.4 Consider G sampled from GEO-P{a, /3,m,p), and define 
Vmin = n~"~^. Then in the GEO-P(a, (3, m,p), Vmin corresponds to the lower bound for 
a volume of the influence region that is obtained for a vertex with rank n. Thus, if the 
distance between u and v is at most Vmin = ^mi^ l'^-, two vertices u and v are adjacent 
with probability p . (Of course, edges can also be created between vertices which are a 
larger distance apart, but this requires additional conditions on the initial rank of these 
vertices.) Recall that, due to our choice of metric, the ball of radius r^m centered at v 
is a hypercube. We will call this the minimal hypercube of v. The minimal region of a 
vertex is always a subset of its region of influence. 

Fix V G V{G). Without loss of generality, we can assume that v is the origin of the 
hypercube S; that is, f = (0, 0, . . . , 0) G S'. In order to estimate c(f ) from below we 
partition the minimal hypercube of v into K"^ disjoint, identical, small hypercubes of 
volume log^ n/n each. The expected number of neighbours of v in every ball is plog^ n, 
so wepthe number of neighbours equals (1 + 0(log~^^^ n))plog^ n. Note that 



K 



log^ n 



1/m 



n 



l-a-l3 \ l/m 



n 



log n 



log n 



(1 + 0(1/7^)). 



Recall that \_x\ 2 is the largest even integer smaller than or equal to x. 
We index the balls as follows: 



bii,i2,...,ir, 



(aii_i,aj) X ■ ■ ■ X (ai„_i,ai). 



where a, 



1/m 



+ i 



log'' n 



. For all s G [m], ig takes values is 



1,2 K. 



Note that every vertex in j 



falls into the influence region of every vertex in 



h ■■• jm if the following condition holds: 



[*): For all s G [m] 



js\<K/2-l. 
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Using this observation, we can derive a lower bound on e{N{v)), the number of edges 
in the neighbourhood of v. By symmetry, we have that wep 

^ K K K K 

n=l im = lji=l,{*) j„ = l,(*) 

^ K K K K 

n=l im=l jl=l,(*) j,„=l,(*) 

2^ K/2 K/2 K/2-l+h K/2-l+im 

«1 = 1 im = l il=l im = l 

we use the notation X]^=i (*) indicate that the sum is over all js between 1 and n 
such that (*) holds. 

By symmetry, we have that the case where ig < K/2 is symmetrical to the case where 
K — ig + l < K/2. So we can count e{N{v)) by considering only the hypercubes hi^^,,,^i^ 
where is < K/2 for 1 < s < m, and multiplying the result by 2"^. Using this symmetry 
argument, we have that wep 

2^ K/2 K/2 K/2-l+h K/2-l+im 

e{N{v)) >yE'"E E E (^(Ki2,...,i,r.,bj^,j2,...,jJ 

il = l im = l il=l im = l 

K/2 K/2 X/2-l+jm 

= (i+«(i))yE---E E ■•• E ip^^s'nrp- 

n=i 2771=1 ii=i i77i=i 

we use the notation X]^=i (*) to indicate that the sum is over all js between 1 and n 
such that (*) holds. 
Thus, wep 

K/2 K/2 \ (K \ 

e{N{v)) > (l + o(l))-J]---J] + ... + (plog^^nfp 

n=l im=l ^ ^ ^ ^ 

,^ , .^^^2"^ f fK/2 + iK KY . , 3 ,2 

= (1 + 0(1))^ (l-^)i^')"(plog'^)^P 

(q\ ™ i pn^~'^~^ ) 
4) ^ 2 

Fix a vertex fj G V^(G'), i = an for some a e [0, 1]. Suppose that the initial rank of 
Vi is Ri = hn for some h G (0, 1]. It follows from Theorem 4.4 that wep 

a X \ i-a~l3 



deg{v,) = (1 + o{l))p ( ^-^ + - a) ) n 
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Hence, by (10) wep c{vi) may be bounded from below as follows: 



civ,) > (l + o(l))exp(o( 
= (1 + o{l)) exp (O 



^dcg(D,;,L)j 



-P 



1 — a 



-2 



p. 



Since the initial rank is distributed uniformly at random, we derive that the expected 
value of c(vi) can be bounded as follows: 



E{c{v.))>{l + o{l))exp (O (^)) 



Therefore, we find that wep 

c{G) > (l + o(l))exp (o(^ 

where 



p 



1 — a 



+ 6-"(l-a) 



db. 



D(a) 



1 pi 



JO 



1 — a 



pD{a) 



db da. 



After changing the order of integration, it is straightforward to see that 



D(a) 



1 fi 



Jo 



1 



1 — a 
1 

1 — a 



((l-a)-n 

n 6=1 1 



1 



da db 



l—a 



a=l 



a=0 



a 



'l—a 



1 + a 



1 



a 



db 



6=0 



1 + a' 



□ 



which finishes the proof of the theorem. 

This proof shows that, as the dimension m increases, this lower bound on the size 
of e(A^(f)), and thus the number of triangles, decreases exponentially. This concurs 
with our interpretation of the dimension as the minimal number of attributes that 
characterizes a user in a social network. It makes sense that assortativity decreases as 
the dimension increases: if a user has a friend A with whom she shares interests in one 
dimension, and a friend B with whom she connects in another dimensions, then the 
chances that A and B become friends may not be much larger than dictated by random 
chance. 
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Proof of Theorem 3.5 , In order to prove Theorem 3.5 we show that there are sparse 
cuts in our model. As in the previous subsection, for sets X and Y we use the notation 
e{X, Y) for the number of edges with one end in each of X and Y. Suppose that the 
unit hypercube S = [0, l]™ is partitioned into two sets of the same volume, 

Si = {x = (xi, X2, . . . , Xm) e S : Xi < 1/2}, 

and 5*2 = S\Si. Both 5*1 and 5*2 contain (l + o(l))n/2 vertices wep. In a good expander 
(for instance, the binomial random graph G{n,p)), wep there would be 

(1 + 0(1))^ = (1 + 

edges between 5*1 and 5*2. Below we show that it is not the case in our model. 

Theorem 4.5. Let a e (0,1), /3 G (0, 1 - a), m G N, and p G (0,1]. Then wep 
GEO-P{a, (3, m,p) has the following property: 

(i) if m = m{n) = o(logn), then e{Si, S2) = o(n^^"~^), 

(ii) if m = m{n) = Clogn for some C > 0, then 

e{SuS2) < (l + o(l))^^^^n^-°-^exp 

Proof. Let us call a vertex v dangerous if, at some point of the protean process, the 
influence region of v has nonempty intersection with both 5*1 and S2] that is, there 
exists t such that R{v, t) fl 5*1 7^ and R{v, t) H 5*2 7^ 0. It is easy to see that the older 
vertex of every edge between Si and S2 must be dangerous. We are going to estimate 
6(5*1, S2) by investigating the number of dangerous vertices with the final rank from a 
given range. Since the number of edges adjacent to a given dangerous vertex in the cut 
can be estimated by its degree, the conclusion can be obtained. 

First, let us consider vertices with small ranks; that is, vertices with r{v,L) < 
y/nlog^ n. Unfortunately, for these vertices we cannot control the behaviour of the 
corresponding random variables r{v,t) during the process. However, deterministi- 
cally \R{v,t)\ = 0{n^^) for all vertices at any point of the process. Since vertices 
are distributed in S uniformly at random, the expected number of dangerous vertices 
with r(f , L) < ^/n\og^ n can be estimated by 0{n~^^"^)^/n\og^ n = 0{n^^'^~^^^ log^ n). 
Hence, wep the number of dangerous vertices from this range is 

Q(^max{l/2-/3/m,0} ^^^2 

by the Chernoff bound. As we already mentioned, it is not possible to estimate the 
number of edges adjacent to these vertices. However, by considering the extreme case, 
that is, when these vertices have the smallest possible ranks during the whole time, we 
obtain that wep the number of edges between these vertices and older ones is at most 



Q^„max{l/2-^(/m,0} lQg2 ^-j 



E 



^-a^l-/3 ^ Q(^^l-/3+{l-a)max{l/2-/3/m,0}jQg2(l-a)^^| 
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Now, for a given i = 1,2, . . . , ^ log2 n — 2 logg log n, consider vertices with 



2^-^v^log^n < r{v,L) < 2'^/nlog^ 



n. 



By Lemma 



4.2 



wepr{v,t) = (l + 0(log ^^'^ n))r{v, L) (which implies that \R{v,t) \ = 
[1 + 0{\og~^^'^ n))r{v , L)~"'n~^) during the whole its life. It follows from the Chernoff 
bound that wep the number of dangerous vertices that satisfy (11) is at most 

0((2^v^log2n)^-°/'^^ 



n 



and the number of edges adjacent to these vertices can be estimated by 
0((2*v^log2n)^-"/"n-''/'") ■ 0((2*v^log^ n)-"ni-^) 



0((2^v^log^n)^" 



Finally, wep the number of edges in the cut, that are adjacent to vertices with final 
ranks at least ^/n\og'^ n, is 



i logj n-2 log2 log n 



E 

i=l 



0((2*v^log2 nY 



m + l „, m. + l , 



0{n^ "m^'^logn), 

^/ 3 m+l a m+l o ^ 
C(?T,2 m 2 m P [og 



if a < 

m+l ■ 

if a 



m+l ' 
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n] 



if a > 



m+l ' 



which is o(?2^~"~^), provided that m = o(logn). Hence, item (1) of the theorem follows. 

For item (2) in the case m = Clogn for some C > 0, we must be more precise and 
take care of constants hidden in the O(-) notation. It can be shown that wep 



6(5*1, 52 



< 



;i+o(i)) 



p 



4(1 - a) 

P_ 

4(1 -aj 

P 

4(1 -a) 



-n 



2- 



m + l ^, m + l , 



-n 



2-a- 



n 



2-a- 



exp 



exp 



a + l3 
C 



logn 



and the proof is complete. 



□ 



To finish the proof of Theorem 3.5[ we use the expander mixing lemma for the 
normalized Laplacian (see [S] for its proof). For sets of nodes X and Y we use the 
notation vol(X) for the volume of the subgraph induced by X, X for the complement 
of X, and, as introduced before, e{X, Y) for the number of edges with one end in each 
of X and Y. (Note that X (lY does not have to be empty; in general, e{X, Y) is defined 
to be the number of edges between X \ Y to Y plus twice the number of edges that 
contain only vertices of X fl Y.) 



Lemma 4.6. For all sets X C G, 



e{X,X) 



vol(G) 



< A 



vol(X)vol(X) 
vol(G) ' 
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Proof of Theorem 3.5 It follows from ([T]) and Chernoff bound that wep 



vo\{Gl) = (l + o(l))-^n2— 

1 — a 

vol(50 = (l + o(l))— ^n^— ^ = vol(52). 



Suppose first that m = o(logn). From Theorem 4.5 we derive that wep 



e(5i,5i) = vol(5i)-e(5i,52) = (l + o(l))vol(5i) 
= (l + o(l)) - 



and Lemma 4.6 implies that wep A„ > 1 + o(l). By definition, < 1 so = 1 + o(l). 

Suppose now that m = Clogn for some constant C > 0. Using Theorem 4.5 one 
more time, we find that wep 

e{S^, S,) = (1 + oil))^n^-~^ (l - ^^pIzS:) 



As before, the assertion follows directly from Lemma 4.6 □ 



5. Conclusion and Discussion 

We introduced the geo-protean (GEO-P) geometric model for OSNs, and showed 
that with high probability, the model generates graphs satisfying each of the properties 
(i) to (iv) in the introduction. We introduce the dimension of an OSN based on our 
model, and examine this new parameter using actual OSN data. We observed that 
the dimension of various OSNs ranges from four to 7. It may therefore, be possible to 
group users via a relatively small number of attributes, although this remains unproven. 
The Logarithmic Dimension Hypothesis (or LDH) conjectures that the dimension of an 
OSN is best fit by logn, where n is the number of users in the OSN. 

The ideas of using geometry and dimension to explore OSNs deserves to be more 
thoroughly investigated. Given the availability of OSN data, it may be possible to fit 
the data to the model to determine the dimension of a given OSN. Initial estimates 
from actual OSN data indicate that the spectral gap found in OSNs correlates with the 
spectral gap found in the GEO-P model when the dimension is approximately logn, 
giving some credence to the LDH. Another interesting direction would be to generalize 
the GEO-P to a wider array of ranking schemes (such as ranking by age or degree), 
and determine when similar properties (such as power laws and bad spectral expansion) 
provably hold. 

We finish by mentioning that recent work [7] indicates that social networks lack 
high compressibility, especially in contrast to the web graph. We propose to study the 
relationship between the GEO-P model and the incompressibility of OSNs in future 
work. 
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